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Abstract 



We consider online planning in Markov decision processes (MDPs) . In online planning, 
the agent focuses on its current state only, deliberates about the set of possible policies from 
that state onwards and, when interrupted, uses the outcome of that exploratory deliberation 
to choose what action to perform next. The performance of algorithms for online planning 
is assessed in terms of simple regret, which is the agent's expected performance loss when 
the chosen action, rather than an optimal one, is followed. 

To date, state-of-the-art algorithms for online planning in general MDPs are either 
best effort, or guarantee only polynomial-rate reduction of simple regret over time. Here 
we introduce a new Monte-Carlo tree search algorithm, BRUE, that guarantees exponential- 
rate reduction of simple regret and error probability. This algorithm is based on a simple 
yet non-standard state-space sampling scheme, MCTS2e, in which different parts of each 
sample are dedicated to different exploratory objectives. Our empirical evaluation shows 
that BRUE not only provides superior performance guarantees, but is also very effective in 
practice and favorably compares to state-of-the-art. We then extend BRUE with a variant 
of "learning by forgetting." The resulting set of algorithms, BRUE(a), generalizes BRUE, 
improves the exponential factor in the upper bound on its reduction rate, and exhibits even 
more attractive empirical performance. 

1. Introduction 

Markov decision processes (MDPs) are a standard model for planning under uncertainty (Put- 
erman, 1994). An MDP (S, A, Tr, R) is denned by a set of possible agent states S, a set of 
agent actions A, a stochastic transition function Tr : S x A x S — > [0, 1], and a reward func- 
tion R : SxAxS — >M. Depending on the problem domain and the representation language, 
the description of the MDP can be either declarative or generative (or mixed). In any case, 
the description of the MDP is assumed to be concise. While declarative models provide 
the agents with greater algorithmic flexibility, generative models are more expressive, and 
both types of models allow for simulated execution of all feasible action sequences, from 
any state of the MDP. The current state of the agent is fully observable, and the objective 
of the agent is to act so to maximize its accumulated reward. In the finite horizon setting 
that will be used for most of the paper, the reward is accumulated over some predefined 
number of steps H. 

The desire to handle MDPs with state spaces of size exponential in the size of the model 
description has led researchers to consider online planning in MDPs. In online planning, 



the agent, rather than computing a quality policy for the entire MDP before taking any 
action, focuses only on what action to perform next. The decision process consists of a 
deliberation phase, aka planning, terminated either according to a predefined schedule or 
due to an external interrupt, and followed by a recommended action for the current state. 
Once that action is applied in the real environment, the decision process is repeated from 
the obtained state to select the next action and so on. 

The quality of the action a, recommended for state s with H steps-to-go, is assessed in 
terms of the probability that a is sub-optimal, and in terms of the (closely related) measure 
of simple regret A#[s, a]. The latter captures the performance loss that results from taking 
a and then following an optimal policy tt* for the remaining H — 1 steps, instead of following 
7r* from the beginning (Bubeck & Munos, 2010). That is, 

A H [s, a] = Q H (s, 7r*0, H)) - Qh(s, a), 

where 

Q H (s, a) = E s , [R(s, a, s') + Qh-i(s', tt*(s', H - 1))] . 

With a few recent exceptions developed for declarative MDPs (Bonet k, Geffner, 2012; 
Kolobov, Mausam, & Weld, 2012; Busoniu & Munos, 2012), most algorithms for online 
MDP planning constitute variants of what is called Monte-Carlo tree search (MCTS). One 
of the earliest and best-known MCTS algorithms for MDPs is the sparse sampling algorithm 
by Kearns, Mansour, and Ng (Kearns, Mansour, & Ng, 1999). Sparse sampling offers a near- 
optimal action selection in discounted MDPs by constructing a sampled lookahead tree in 
time exponential in discount factor and suboptimality bound, but independent of the state 
space size. However, if terminated before an action has proved to be near-optimal, sparse 
sampling offers no quality guarantees on its action selection. Thus it does not really fit 
the setup of online planning. Several later works introduced interruptible, anytime MCTS 
algorithms for MDPs, with UCT (Kocsis & Szepesvari, 2006) probably being the most 
widely used such algorithm these days. Anytime MCTS algorithms are designed to provide 
convergence to the best action if enough time is given for deliberation, as well as a gradual 
reduction of performance loss over the deliberation time (Sutton & Barto, 1998; Peret & 
Garcia, 2004; Kocsis & Szepesvari, 2006; Coquelin & Munos, 2007; Cazenave, 2009; Rosin, 
2011; Tolpin & Shimony, 2012). While UCT and its successors have been devised specifically 
for MDPs, some of these algorithms are also successfully used in partially observable and 
adversarial settings (Gelly & Silver, 2011; Sturtevant, 2008; Bjarnason, Fern, & Tadepalli, 
2009; Balla & Fern, 2009; Eyerich, Keller, & Helmert, 2010). 

In general, the relative empirical attractiveness of the various MCTS planning algorithms 
depends on the specifics of the problem at hand and cannot usually be predicted ahead of 
time. When it comes to formal guarantees on the expected performance improvement over 
the planning time, very few of these algorithms provide such guarantees for general MDPs, 
and none breaks the barrier of the worst-case only polynomial-rate reduction of simple regret 
and choice-error probability over time. 

This is precisely our contribution here. We introduce a new Monte-Carlo tree search 
algorithm, BRUE, that guarantees exponential-rate reduction of both simple regret and 
choice-error probability over time, for general MDPs over finite state spaces. The algorithm 
is based on a simple and efficiently implementable sampling scheme, MCTS2e, in which 



2 



MCTS : [input: (S,A,Tr,R); s G S] 

search tree T ^— root node so 

while time permits: 

p ^— sample(s , T) 

T ^— expand-tree(T, p) 

update-statistics(T, p) 

return recommend-action(so, T) 



Figure 1: High-level scheme for regular Monte-Carlo tree sampling. 

different parts of each sample are dedicated to different competing exploratory objectives. 
The motivation for this objective decoupling came from a recently growing understanding 
that the current MCTS algorithms for MDPs do not optimize the reduction of simple regret 
directly, but only via optimizing what is called cumulative regret, a performance measure 
suitable for the (very different) setting of reinforcement learning (Bubeck & Munos, 2010; 
Busoniu & Munos, 2012; Tolpin & Shimony, 2012; Feldman & Domshlak, 2012). Our 
empirical evaluation on some standard MDP benchmarks for comparison between MCTS 
planning algorithms shows that BRUE not only provides superior performance guarantees, 
but is also very effective in practice and favorably compares to state of the art. We then 
extend BRUE with a variant of "learning by forgetting." The resulting family of algorithms, 
BRUE(a), generalizes BRUE, improves the exponential factor in the upper bound on its 
reduction rate, and exhibits even more attractive empirical performance. 

2. Monte-Carlo Planning 

MCTS, a high-level scheme for Monte-Carlo tree search that gives rise to various specific 
algorithms for online MDP planning, is depicted in Figure 1. Starting with the current state 
so, MCTS performs an iterative construction of a tree T rooted at so- At each iteration, 
MCTS issues a state-space sample from so, expands the tree T using the outcome of that 
sample, and updates information stored at the nodes of T. Once the simulation phase is 
over, MCTS uses the information collected at the nodes of T to recommend an action to 
perform in so- For compatibility of the notation with prior literature, in what follows we 
refer to the tree nodes via the states associated with these nodes. Note that, due to the 
Markovian nature of MDPs, it is unreasonable to distinguish between nodes associated with 
the same state at the same depth. Hence, the actual graph constructed by most instances 
of MCTS forms a DAG over nodes (s, h) G S x {0, 1, . . . , H}. By A(s) C A in what follows, 
we refer to the subset of actions applicable in state s. 

Numerous concrete instances of MCTS have been proposed, with UCT (Kocsis & Szepesvari, 
2006) probably being the most popular such algorithm these days (Gelly &: Silver, 2011; 
Sturtevant, 2008; Bjarnason et al., 2009; Balla & Fern, 2009; Eyerich et al., 2010; Keller 
& Eyerich, 2012a). To give a concrete sense of MCTS's components, as well as to ground 
some intuitions discussed later on, below we describe the specific setting of MCTS corre- 
sponding to the core UCT algorithm, and Figure 2 illustrates the UCT tree construction, 
with n denoting the number of state-space samples. 
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• sample: The samples p = (so, a\, s±, . . . , a&, Sk) are all issued from the root node so- 
The sample ends either when a sink state is reached, that is, A(sk) = 0, or when 
k = H. Each node/action pair (s,a) is associated with a counter n(s,a) and a value 
accumulator Q(s,a). Both n(s,a) and Q(s,a) are initialized to 0, and then updated 
by the update-statistics procedure. Given Si, the next-on-the-sample action dj+i is 
selected according to the deterministic UCB1 policy (Auer, Cesa-Bianchi, &; Fischer, 
2002a), originally proposed for optimal cumulative regret minimization in stochastic 
multi-armed bandit (MAB) problems (Robbins, 1952): If n(sj, a) > for all a G A(si), 
then 



where n(s) = ^2 a n(s, a). Otherwise, a^+i is selected uniformly at random from the 
still unexplored actions {a £ A(si) | n(sj,a) = 0}. In both cases, Sj+i is then sam- 
pled according to the conditional probability P(S'|sj, aj+i), induced by the transition 
function Tr. 

• expand-tree: Each state-space sample p = (so, a±, s±, . . . , a&, Sk) induces a state trace 
(so, si, . . . , Si) inside T, as well as a state trace (sj+i, . . . , s^) outside of T ■ In principle, 
T can be expanded with any prefix of . . . , s^); a popular choice in prior work 
appears to be expanding T with only the upper-most node Sj+i. (If T is constructed 
as a DAG, it is expanded with the first node along p that leaves T.) 

• update-statistics: For each node Si along p that is now part of the expanded tree T, 
the counter n(sj, aj + i) is incremented and the estimated Q-value is updated as 



• recommend-action: Interestingly, the action recommendation protocol of UCT was 
never properly specified, and different applications of UCT adopt different decision 
rules, including maximization of the estimated Q-value, of the augmented estimated 
Q-value as in Eq. 1, of the number of times the action was selected during the sim- 
ulation, as well as randomized protocols based on the information collected at the 
root. 

The key property of UCT is that its exploration of the search space is obtained by 
considering a hierarchy of forecasters, each minimizing its own cumulative regret, that is, 
the loss of the total reward incurred by exploring the environment (Auer et al., 2002a). 
Each such pseudo-agent forecaster corresponds to a state/steps-to-go pair (s,h). In that 
respect, according to Theorem 6 of Kocsis and Szepesvari (2006), UCT asymptotically 
achieves the best possible (logarithmic) cumulative regret. However, as recently pointed out 
in numerous works (Bubeck & Munos, 2010; Busoniu &: Munos, 2012; Tolpin &: Shimony, 
2012; Feldman &: Domshlak, 2012), cumulative regret does not seem to be the right objective 
for online MDP planning, and this is because the rewards "collected" at the simulation 




(1) 



Q(si,a i+1 ) <r- Q(si,a i+1 ) + 



Ri — Q(sj, dj+i) 
n(si,a i+1 ) 



(2) 



where Ri = Y, k j=l R{ s j,a j+ i, s j+1 ). 



4 



UCB 


V ~| 


»=2 


UCB 
' Uniform 


K*3 


8 

> Uniform 


n=4 


}_ 

J 


— < 

{ ,4 


n=l 




n=3 

{?^/^^^5\£^=12 


n=4 


n=5 


n=50 

^ 7 UCB 
^^^^L Uniform 



Figure 2: Illustration of the UCT dynamics 



phase are fictitious. Furthermore, the work of Bubeck, Munos, and Stoltz (2011) on multi- 
armed bandits shows that minimizing cumulative regret and minimizing simple regret are 
somewhat competing objectives. Indeed, the same Theorem 6 of Kocsis and Szepesvari 
(2006) claims only a polynomial-rate reduction of the probability of choosing a non-optimal 
action, and the results of Bubeck et al. (2011) on simple regret minimization in MABs with 
stochastic rewards imply that UCT achieves only polynomial-rate reduction of the simple 
regret over time. Some attempts have recently been made to adapt UCT, and MCTS-based 
planning in general, to optimizing simple regret in online MDP planning directly, and some 
of these attempts were empirically rather successful (Tolpin & Shimony, 2012; Hay, Shimony, 
Tolpin, & Russell, 2012). However, to the best of our knowledge, none of them breaks UCT's 
barrier of the worst-case polynomial-rate reduction of simple regret over time. 

3. Simple Regret Minimization in MDPs 

We now show that exponential-rate reduction of simple regret in online MDP planning is 
achievable. To do so, we first motivate and introduce a family of MCTS algorithms with a 
two-phase scheme for generating state space samples, and then describe a concrete algorithm 
from this family, BRUE, that (1) guarantees that the probability of recommending a non- 
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optimal action asymptotically convergences to zero at an exponential rate, and (2) achieves 
exponential-rate reduction of simple regret over time. 

3.1 Exploratory concerns in online MDP planning 

The work of Bubeck et al. (2011) on pure exploration in multi-armed bandit (MAB) prob- 
lems was probably the first to stress that the minimal simple regret can increase as the 
bound on the cumulative regret is decreases. At a high level, Bubeck et al. (2011) show 
that efficient schemes for simple regret minimization in MAB should be as exploratory as 
possible, thus improving the expected quality of the recommendation issued at the end of 
the learning process. In particular, they showed that the simple round-robin sampling of 
MAB actions, followed by recommending the action with the highest empirical mean, yields 
exponential-rate reduction of simple regret, while the UCB1 strategy that balances between 
exploration and exploitation yields only polynomial-rate reduction of that measure. In that 
respect, the situation with MDPs is seemingly no different, and thus Monte-Carlo MDP 
planning should focus on exploration only. However, the answer to the question of what it 
means to be "as exploratory as possible" with MDPs is less straightforward than it is in 
the special case of MABs. 

For an intuition as to why the "pure exploration dilemma" in MDPs is somewhat com- 
plicated, consider the state/steps-to-go pairs (s, h) as pseudo- agents, all acting on behalf of 
the root pseudo-agent (sq,H) that aims at minimizing its own simple regret in a stochastic 
MAB induced by the applicable actions ^4(so)- Clearly, if an oracle would provide (so,H) 
with an optimal action ir*(so, H), then no further deliberation would be needed until after 
the execution of ir*(so,H). However, the task characteristics of (so,H) are an exception 
rather than a rule. Suppose that an oracle provides us with optimal actions for all pseudo- 
agents (s,h) but (so,H). Despite the richness of this information, (sq,H) in some sense 
remains as clueless as it was before: To choose between the actions in A(so), (so, H) needs, 
at the very least, some ordinal information about the expected value of these alternatives. 
Hence, when sampling the futures, each non-root pseudo-agent (s, h) should be devoted to 
two objectives: 

(1) identifying an optimal action tt*(s, h), and 

(2) estimating the actual value of that action, because this information is needed by the 
predecessor (s) of (s, h) in T. 

Note that both these objectives are exploratory, yet the problem is that they are some- 
what competing. In that respect, the choices made by UCT actually make sense: Each 
sample p issued by UCT at (s, h) is a priori devoted both to increasing the confidence in that 
some current candidate a) for ir*(s, h) is indeed ir*(s, h), as well as to improving the estimate 
of Qh(s, a^), while as if assuming that tt*(s, h) = a) . However, while such an overloading of 
the samples is unavoidable in the "learning while acting" setup of reinforcement learning, 
this should not necessarily be the case in online planning. Moreover, this sample overload- 
ing in UCT comes with a high price: As it was shown by Coquelin and Munos (2007), the 
number of samples after which the bounds of UCT on both simple and cumulative regret 
become meaningful might be as high as hyper-exponential in H. 



6 



3.2 Separation of Concerns at the Extreme 



Separating the two aforementioned exploratory concerns is at the focus of our investigation 
here. Let sq be a state of an MDP {S, A, Tr, R) with rewards in [0, 1], K applicable actions 
at each state, B possible outcome states for each action, and finite horizon H. First, to 
get a sense of what separation of exploratory concerns in online planning can buy us, we 
begin with a MAB perspective on MDPs, with each arm in the MAB corresponding to a 
"flat" policy of acting for H steps starting from the current state sq. A "flat" policy tt is a 
minimal partial mapping from state/steps-to-go pairs to actions that fully specifies an acting 
strategy in the MDP for H steps, starting at sq. Sampling such an arm tt is straightforward 
as tt prescribes precisely which action should be applied at every state that can possibly 
be encountered along the execution of tt. The reward of such an arm tt is stochastic, with 
support [0, H], and the number of arms in this schematic MAB is K' = K^i=o B% rj K bH . 

Now, consider a simple algorithm, NaiveUniform, which systematically samples each 
"flat" policy in a loop, and updates the estimation of the corresponding arm with the 
obtained reward. If stopped at iteration n, the algorithm recommends tt(so), where tt is 
the arm/policy with best empirical value p, n n . By the iteration n of this algorithm, each 
arm will be sampled at least I k ^h J times. Therefore, using the Hoeffding's inequality, the 
probability that the chosen arm tt is sub-optimal in our MAB is bounded by 

/ [ n H jA^\ 

P{A7T,n > P>n*,n} = P {frn,n ~ P>n*,n ~ (~A^) > A n } < exp I - ] , (3) 

where A w = fi n * — p w , and thus the expected simple regret can be bounded as 

Er n <HK*% W ^J^y (4) 

Note that NaiveUniform uses each sample p = (so, cto, si, ai, ■ ■ • , Off-i, sh) to update the 
estimation of only a single policy tt. However, recalling that arms in our MAB problem 
are actually compound policies, the same sample can in principle be used to update the 
estimates of all policies tt' that are consistent with p in the sense that, for < i < H — 1, 
■iv'(si,H — i) is defined and it is defined as tt 1 (sij H — i) — &i. The resulting algorithm, 
CraftyUniform, generates samples by choosing the actions along them uniformly at random, 
and uses the outcome of each sample to update all the policies consistent with it. Note 
that sampling the arms in CraftyUniform cannot be done systematically as in NaiveUniform 
because the set of policies updated at each iteration is stochastic. 

Since the sampling is uniform, the probability of any policy to be updated by the sample 
issued at any iteration of CraftyUniform is -^jj. For an arm tt', let N n i^ n denote the number 
of samples issued at the n iterations of CraftyUniform that are consistent with the policy tt' . 
The probability that tt, the best empirical arm after n iterations, is sub-optimal is bounded 
by 

P {fln,n > Pw*,n} < P j P-ir,n ~ Mtt > ^ j + P |^*,n ~ Vn* > ^ j • (5) 
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Each of the two terms on the right-hand side can be bounded as: 
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< e 2K 2H -\- e 4K H H 2 

< 2e 4R2H H 2 ^ 



(6) 



where (f) and (J) are by the Hoeffding inequality. In turn, similarly to Eq. 4, the simple 
regret for CraftyUniform is bounded by 

Er n < 4HK B e ^h h i _ (7) 

Since H is a trivial upper-bound on Er n , the bound in Eq. 7 becomes effective only when 
AK bH exp (- 4K r 2 d H H2 ) < 1, that is, for 



n> (K 2 B) H -4^VlogK. 



(8) 



Note that this transition period length is still much better than that of UCT, which is 
hyper-exponential in H. Moreover, unlike in UCT, the rate of the simple regret reduction 
is then exponential in the number of iterations. 

3.3 Two-phase sampling and BRUE 

While both the simple regret convergence rate, as well as the length of the transition period 
of CraftyUniform, are more attractive than those of UCT, this in itself is not much of a 
help: CraftyUniform requires explicit reasoning about K sH arms, and thus it cannot be 
efficiently implemented. However, it does show the promise of separation of concerns in 
online planning. We now introduce an MCTS family of algorithms, referred to as MCTS2e, 
that allows utilizing this promise to a large extent. 

The instances of the MCTS2e family vary along four parameters: switching point func- 
tion a : N — > {1, . . . , H}, exploration policy, estimation policy, and update policy. With 
respect to these four parameters, the MCTS components in MCTS2e are as follows. 

• Similarly to UCT, each node/action pair (s,a) is associated with variables n(s,a) 
and Q(s,a). However, while counters n(s,a) are initialized to 0, value accumulators 
Q(s,a) are schematically initialized to —00. 
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• sample: Each iteration of BRUE corresponds to a single state-space sample of the 
MDP, and these samples p = (sq, a±, s±, . . . , a^, s&) are all issued from the root node 
so- The sample ends either when a sink state is reached, that is, A(sk) = 0, or when 
k = H. The generation of p is done in two phases: At iteration n, the actions at states 
so, • • • , ■s cr (n)-i are selected according to the exploration policy of the algorithm, while 
the actions at states s CT ( n ), . . . , Sk-i are selected according to its estimation policy. 

• expand-tree: T is expanded with the suffix of state sequence si, . . . , S(j( n )_i that is 
new to T ■ 

• update-statistics: For each state Sj G {so, . . . ,s a r n \i}, the update policy of the al- 
gorithm prescribes whether it should be updated. If Sj should be updated, then the 
counter n{si,cn + \) is incremented and the estimated Q-value is updated according to 
Eq. 2 (p. 4). 

• recommend-action: The recommended action is chosen uniformly at random among 
the actions a maximizing Q(so,a). 

In what follows, for n > 0, the n-th iteration of BRUE will be called H-iteration if cr(n) = %. 
At a high level, the two phases of sample generation respectively target the two exploratory 
objectives of online MDP planning: While the sample prefixes aim at exploring the options, 
the sample suffixes aim at improving the value estimates for the current candidates for it* . 
In particular, this separation allows us to introduce a specific MCTS2e instance, BRUE, 1 
that is tailored to simple regret minimization. The BRUE setting of MCTS2e is described 
below, and Figure 3 illustrates its dynamics. 

• The switching point function a : N — >■ {1, . . . , H} is 

a(n) = H - ((n - 1) mod H), (9) 

that is, the depth of exploration is chosen by a round-robin on {1, ... , H}, in reverse 
order. 

• At state s, the exploration policy samples an action uniformly at random, while the 
estimation policy samples an action uniformly at random, but only among the actions 
a G A(s) that maximize Q(s,a). 

• For a sample p issued at iteration n, only the state/action pair (s CT ( n )_i, ^(n)) immedi- 
ately preceding the switching state s CT ( n ) along p is updated. That is, the information 
obtained by the second phase of p is used only for improving the estimate at state 
s a(n)-ii an d is not pushed further up the sample. While that may appear wasteful 
and even counterintuitive, this locality of update is required to satisfy the formal 
guarantees of BRUE discussed below. 

Before we proceed with the formal analysis of BRUE, a few comments on it, as well as 
on the MCTS2e sampling scheme in general, are in place. First, the template of MCTS2e is 

1. Short for Best Recommendation with Uniform Exploration; the name is carried on from our first 
presentation of the algorithm in (Feldman & Domshlak, 2012), where "estimation" was referred to as 
"recommendation." 



9 



n=l 

Uniform < a '" r, T 

::j 

Switching 

Empirical/ T ' 

Best [ a "' rH 


«=2 

Switching ._ _ 


f J 

Uniform { 

{ :> 

Empirical J a r 
Best | i 

1 


Switching 


n=l 
• 

Switching 
• point 


• 

Switching 

point ~~~~~* • 


n=3 
• 

Switching 
■ * — — point 

-I . 

, : "I 




n=/0 


n=20 

- A A 
' V -I 


n=50 



Figure 3: Illustration of the BRUE dynamics 



rather general, and some of its parametrizations will not even guarantee convergence to the 
optimal action. This, for instance, will be the case with a (seemingly minor) modification 
of BRUE to purely uniform estimation policy. In short, MCTS2e should be parametrized 
with care. Second, while in what follows we focus on BRUE, other instances of MCTS2e 
may appear to be empirically effective as well with respect to the reduction of simple regret 
over time. Some of them, similarly to BRUE, may also guarantee exponential-rate reduction 
of simple regret over time. Hence, we clearly cannot, and do not, claim any uniqueness of 
BRUE in that respect. Finally, some other families of MCTS algorithms, more sophisticated 
that MCTS2e, can give rise to even more (formally and/or empirically) efficient optimizers of 
simple regret. The BRUE(a) set of algorithms that we discuss later on is one such example. 
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4. Upper Bounds on Simple Regret Reduction Rate with BRUE 

For the sake of simplicity, in our formal analysis of BRUE we assume uniqueness of the 
optimal policy ir*; that is, at each state s and each number h of steps-to-go, there is a 
single optimal action, and it is ir*(s,h). Let T n be the graph obtained by BRUE after n 
iterations, and let Qh(s, a) denote the accumulated value Q(s, a) for s at depth H — h. For 
all state/steps-to-go pairs (s, 6 T n , ^(s, h) is a randomized strategy, uniformly choosing 
among actions a maximizing Qh(s,a). We also use some additional auxiliary notation. 

K = max^s |^4(s)|, i.e., the maximal number of actions per state. 

p = mm s a s i. Tr ( s a s i} >0 Tr(s,a,s'), i.e., the likelihood of the least likely (but still possi- 
ble) outcome of an action in our problem. 

d = min Sia Ai[s, a], i.e., the smallest difference between the value of the optimal and a 
second-best action at a state with just one step-to-go. 

Our key result on the BRUE algorithm is Theorem 1 below. The proof of Theorem 1, as 
well as of several required auxiliary claims, is given in Appendix A. Here we outline only 
the key issues addressed by the proof, and provide a high-level flow of the proof in terms of 
a few central auxiliary claims. 

Theorem 1 Let BRUE be called on a state sq of an MDP (S, A,Tr, R) with rewards in 
[0,1] and finite horizon H. There exist pairs of parameters c,c' > 0, dependent only on 
{p,d,K,H}, such that, after n > H iterations o/BRUE, we have simple regret bounded as 

EA^[ S , tt^( So , H)} < He • e~ c ' n , (10) 

and choice-error probability bounded as 

P {tt* ( So , H) / tt*( So , H)} < c ■ e- c ' n . (11) 

In particular, these bounds hold for 

C ~ ( pH 2 -4:H+2p3H 2 -3H ' ^ > 



and 



C 2H16 H - 1 (H\) 2 K 2H ' { ' 



Before we proceed any further, some discussion of the statements in Theorem 1 are in 
place. First, the parameters c and d in the bounds established by Theorem 1 are problem- 
dependent: in addition to the dependance on the horizon H and the choice branching factor 
K (which is unavoidable), the parameters c and c' also depend on the distribution param- 
eters p and d. While it is possible that this dependence can be partly alleviated, Bubeck 
et al. (2011) showed that distribution- free exponential bounds on the simple regret reduc- 
tion rate cannot be achieved even in MABs, that is, even in single-step-to-go MDPs (see 
Remark 2 of Bubeck et al. (2011), which is based on a lower bound on the cumulative 
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regret established by Auer, Cesa-Bianchi, Freund, & Schapire, 2002b). Second, the specific 
parameters c and d provided by Eqs. 12 and 13 are worst-case for MDPs with parameters 
d, p, and K, and the bound in Eq. 10 becomes effective after 

(kh\ £H2 ~ 

\pd ) 

iterations, for some small constant e > 1. While there is still some gap with this transition 
period length and the transition period length of the theoretical CraftyUniform algorithm 
(see Eq. 8), this gap is not that large. 2 

The proof of Lemma 2 below constitutes the crux of the proof of Theorem 1. Once 
we have proven this lemma, the proof of Theorem 1 stems from it in a more-or-less direct 
manner. 



n > 



ln(c) 



O 



Lemma 2 Let BRUE be called on a state s$ of an MDP (S, A, Tr, R) with rewards in [0, 1] 
and finite horizon H. For each h € \H\, there exist parameters Ch,c' h > 0, dependent only 
on {p, d, K, H}, such that, for each state s reachable from so in H — h steps and any t > 0, 
it holds that 



Q h (s,a) - Q h (s,a) > 
Q h (s,a) - Q h (s,a) < - 



n h (s,a) = t> < c h e c ft 



-c't 



n h (s,a) = t> < c h e 



(14) 



In particular, these bounds hold for 

K 2Hh+h*-2H-l (W) 3 ( i! )4 2 4^-ll 6 (^-l) 2 



Ch 



and 



^20-l) 2 . p 2Hh+h 2 -2H-h 
- 16 h-lf h \}2 K H+h-l- 



(15) 



(16) 



The proof for Lemma 2 is by induction on h. Starting with the induction basis for h = 1, 
it is easy to verify that, by the Chernoff-Hoeffding inequality, 



Qi(s,a) - Qi (s,a) 



d 

> 

~ 2 



d 2 

n(s,a) = t\ < 2e~^*, 



(17) 



that is, the assertion is satisfied with c\ = 1 and d 1 = \. Now, assuming the claim holds 
for h > 1, below we outline the proof for h + 1, relegating the actual proof in full detail to 
Appendix A. 

In the proof for h > 1, it is crucial to note the invalidity of applying the Chernoff- 
Hoeffding bound directly, as was done in Eq. 17. There are two reasons for this. 



2. Some of this gap can probably be eliminated by more accurate bounding in the numerous bounding steps 
towards the proof of Theorem 1. However, all such improvements we tried made the already lengthy 
proof of Theorem 1 even more involved. 
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(Fl) For h = 1, Q is an unbiased estimator of Q, that is, EQ = Q. In contrast, the 
estimates inside the tree (at nodes with h > 1) are biased. This bias stems from Q 
possibly being based on numerous sub-optimal choices in the sub-tree rooted in (s, h). 

(F2) For h = 1, the summands accumulated by Q are independent. This is not so for h > 1, 
where the accumulated reward depends on the selection of actions in subsequent nodes, 
which in turn depends on previous rewards. 

However, we show that these deficiencies of h > 1 can still be overcome through a novel 
modification of the seminal Hoeffding-Azuma inequality. 

Lemma 3 (Modified Hoeffding-Azuma inequality) Let {Aj}^ 1 be a sequence of ran- 
dom variables with support [0, h] and \ii = EXj. If lim^oo /ij = \i, and 



P {E [X t | Xi, . . . , AVi] ^ iA < c P e~ Ce \ 
for some < c p and < c e < 1, then, for all < 5 < \, it holds that 




> nt + t5 





r 2/i 2 " 








■ e 




r 2h 2 - 








■ e 



3^ 



(18) 

(19) 
(20) 



Together with Lemma 4 below, the inequalities provided by Lemma 3 allow us to prove 
the induction hypothesis in the proof of the central Lemma 2. Note that the specific bound 
in Lemma 3 is selected so to maximize the exponent coefficient. For any < /3 < 1, the 
probabilities of interest in Eqs. 19-20 can also be bounded by 



1 + 



c e (l-/3) 



Ce (1 " P) 

for further details, we refer the reader to Discussion 14 in Appendix A. 



Definition 1 Let (S, A,Tr, R) be an MDP with rewards in [0,1], planned for initial state 
so £ S and finite horizon H. Let s be a state reachable from so with h steps still to go, let 
a be an action applicable in s, and let ir^ be a policy induced by running BRUE on so until 
exactly t > samples have finished their exploration phase with applying action a at s with 
h — 1 steps still to go. Given that, 

• X tj h(s,a) is a random variable, corresponding to the reward obtained by taking a at 
s, and then following isf for the remaining h — 1 steps. 

• E tt h (s, a) is the event in which X t: h{s, a) is sampled along the optimal actions at each 
of the h — 1 choice points delegated to nf. 

• St,h {s, a) = Q h (s, a) - E [X tA (s, a)} . 
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Lemma 4 Let (S, A,Tr, R) be an MDP with rewards in [0,1], planned for initial state 
so £ S and finite horizon H . Let s be a state reachable from s$ with h + 1 steps still 
to go, and a be an action applicable in s. Considering E^h+i (s,a) and 5t,h+i (s,a) as in 
Definition 1, for any t > 0, if Lemma 2 holds for horizon h, then 

F{^E tth+1 (s,a)} < 2Kh(2 + c h )e-^ t , (21) 
S t ,h+i(s,a) < 2Kh 2 {2 + c h )e- V ^ t . (22) 

Together with a modified version of the Hoeffding-Azuma bound in Lemma 3, the bounds 
established in Lemma 4 allow us to derive concentration bounds for Qh+i around Qh+i as 
in Lemma 5 below, which serves the key building block for proving the induction hypothesis 
in the proof of Lemma 2. 

Lemma 5 Let BRUE be called on a state sq of an MDP (S, A, Tr, R) with rewards in [0, 1] 
and finite horizon H. For each state s reachable sq with h + 1 steps still to go, each action 
a applicable, and any t > 0, it holds that 



| Qh+i (s, a) - Q h+1 (s, a) > ^ n h+1 (s, a) = t j < ^3456 • — ^2^2 ~ ) 



g I6(h+1) 2 K _ 

(23) 



5. Learning With Forgetting and BRUE(a) 

When we consider the evolution of action value estimates in BRUE over time (as well 
as in all other Monte-Carlo algorithms for online MDP planning), we can see that, in 
internal nodes these estimates are based on biased samples that stem from the selection 
of non-optimal actions at descendant nodes. This bias tends to shrink as more samples 
are accumulated down the tree. Consequently, the estimates become more accurate, the 
probability of selecting an optimal action increases accordingly, and the bias of ancestor 
nodes shrinks in turn. An interesting question in this context is: shouldn't we weigh 
differently samples obtained at different stages of the sampling process? Intuition tells 
us that biased samples still provide us with valuable information, especially when they 
are all we have, but the value of this information decreases as we obtain more and more 
accurate samples. Hence, in principle, putting more weight on samples with smaller bias 
could increase the accuracy of our estimates. The key question, of course, is which of all 
possible weighting schemes are both reasonable to employ and preserve the exponential-rate 
reduction of expected simple regret. 

Here we describe BRUE (a), an algorithm that generalizes BRUE = BRUE(l) by basing 
the estimates only on the a fraction of most recent samples. We discuss the value of this 
addition both from the perspective of the formal guarantees, as well as from the perspective 
of empirical prospects. BRUE(a) differs from BRUE in two points: 

• In addition to the variables n(s, a) and Q(s, a), each node/action pair (s, a) in BRUE(a) 
is associated with a list C(s, a) of rewards, collected at each of the n(s, a) samples that 
are responsible for the current estimate Q(s,a). 
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When a sample p = (so, ai, Si, . . . , o^, S&) is issued at iteration n, and update-statistics 
updates the variables at x = (s CT ( n )-i, o CT (n))> that update is done not according to 
Eq. 2 as in BRUE, but according to: 

n(x) <— n(x) + 1, 
fc-i 

£(x)[n(x)] <- ^ jR(si,a i+ i,s i+ i), 

j= CT (n)-l (24) 

7l(x) 

Q0»0«- r / — yT A^]- 

|a ' n(Xjl i=n(x)- fa-n(s)l 



Theorem 6 Lei BRUE (a) 6e ca/Zed on a staie so of an MDP (S, A,Tr, R) with rewards 
in [0, 1] and finite horizon H . There exist pairs of parameters c, c' > 0, dependent only on 
{a,p, d, K, H} , such that, after n > H iterations of BRUE, we have simple regret bounded 
as 

EA h [s,tt*(s ,H)} <Hc-e- c ' n , (25) 
and choice-error probability bounded as 

F {tt^so, H) ± it* (s , H)}<c- e~ c ' n . (26) 

The proof for Theorem 6 follows from Lemma 7 below similarly to the way Theorem 1 
follows from Lemma 2. Note that in Theorem 6 we do not provide explicit expressions for 
the constants c and d as we did in Theorem 1 (for a = 1). This is because the expressions 
that can be extracted from the recursive formulas in this case do not bring much insight. 
However, we discuss the potential benefits of choosing a < 1 in the context of our proof of 
Theorem 6. 



Lemma 7 Let BRUE(a) be called on a state so of an MDP (S, A,Tr,R) with rewards in 
[0, 1] and finite horizon H . For each h G \H\, there exist parameters c/j, c' h > 0, dependent 
only on {a, p, d, K, H}, such that, for each state s reachable from so in H — h steps and any 
t > 0, it holds that 



Q h (s,a) - Q h (s,a) > 
Qh (s,a) - Q h (s,a) < - 



n h (s,a) 
n h (s,a) 



< c h e-&. 



(27) 



The proof for Lemma 7 is by induction, following the same line of the proof for Lemma 2. 
In fact, it deviates from the latter only in the application of the modified Hoeffding-Azuma 
inequality, which has to be further modified to capture the partial sums as in BRUE (a). 

Lemma 8 (Modified HoefFding-Azuma inequality for partial sums) Let {Xj}?^ be 
a sequence of random variables with support [0, h] and fii = EAj. 7/lim^oo ^ = fj,, and 



'{E[Xi | X 1 ,...,X i _ 1 ]^ l x] < c p e- 



(28) 
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for some < c p and < c e < 1, then, for all < 5 < \, it holds that 



Xi> fit + t5 

s i=t— \od\ 



> 



1 + 



c e (l - a) 



-c e (l-a) 2 t 



^ *i < At* - t& \ < 



1 + / P e -Ce(l-») 2 f 

c e (l - a) 



e 2h 2 



(29) 
(30) 



Considering the benefits of "sample forgetting" as in BRUE(a), let us compare the bound 
in Lemma 8 to the bound 



1 + 



(1-/3) 



provided by Lemma 3 for BRUE, that is, when all accumulated samples are averaged. While 
both bounds are very similar, the exponent of the second exponential term is multiplied for 
BRUE(a < 1) by (1 — a)t. This poses a tradeoff: Decreasing a reduces the sampling bias, 
and thus decreases the term ^, but increases the other exponential term with no leading 
constant. Obviously, since there is no bias at leaf nodes, it makes no sense to set a < 1 
there. However, as we go further up the tree, the bias tends to grow (— >> 1), but we also 
expect to have more samples (t is larger). Thus, from the perspective of formal guarantees, 
it seems appealing to choose smaller values of a. Nevertheless, we do not try to optimize 
here the value of a: First, optimizing bounds doesn't necessarily lead to optimized empirical 
accuracy. Second, the underlying optimization would have to be specific to each horizon h 
and each sample size t (which is obviously out of the question) , and thus anyway we would 
have to consider only some rough approximations to this optimization problem. Finally, 
biased samples in practice might be more valuable than what the theory suggests, as long 
as all actions at the same state/steps-to-go decision point experience a similar bias. 



6. Experimental Evaluation 

We have evaluated BRUE empirically on the MDP sailing domain (Peret & Garcia, 2004) 
that was used in previous works for evaluating MC planning algorithms (Peret & Garcia, 
2004; Kocsis Sz Szepesvari, 2006; Tolpin & Shimony, 2012), as well as on random game trees 
used in the original empirical evaluation of UCT (Kocsis & Szepesvari, 2006). 

In the sailing domain, a sailboat navigates to a destination on an 8-connected grid 
representing a marine environment, under fluctuating wind conditions. The goal is to reach 
the destination as quickly as possible, by choosing at each grid location a neighbor location 
to move to. The duration of each such move depends on the direction of the move (ceteris 
paribus, diagonal moves take y/2 more time than straight moves), the direction of the wind 
relative to the sailing direction (the sailboat cannot sail against the wind and moves fastest 
with a tail wind), and the tack. The direction of the wind changes over time, but its strength 
is assumed to be fixed. This sailing problem can be formulated as a goal-driven MDP over 
finite state space and a finite set of actions, with each state capturing the position of the 
sailboat, wind direction, and tack. 
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Figure 4: Empirical performance of BRUE, BRUE(0.9), UCT, and e-greedy + UCT (denoted 
as GCT, for short) in terms of the average error on sailing domain problems on 
n x n grids with n £ {5, 10, 20, 40}. 



In a goal-driven MDP, the lengths of the paths to a terminal state are not necessarily- 
bounded, and thus it is not entirely clear to what depth BRUE shall construct its tree. 
In the sailing domain, we chose H to be 4 x n, where n is the grid-size of the problem 
instance, as it is unlikely that the optimal path between any two locations on the grid will 
be larger than a complete encircling of the considered area. We note, however, that the 
recommendation-oriented samples p always end at a terminal state, similar to the rollouts 
issued by UCT and e-greedy + UCT. 

Snapshots of the results for different grid sizes are shown in Figure 4. We compared 
BRUE with two MCTS-based algorithms: the UCT algorithm, and a recent modification of 
UCT, e-greedy + UCT, obtained from the former by replacing the UCB1 policy at the root 
node with the e-greedy policy (Tolpin & Shimony, 2012). The motivation behind the design 
of e-greedy + UCT was to improve the empirical simple regret of UCT, and the results for 
e-greedy + UCT reported by (Tolpin & Shimony, 2012) (and confirmed by our experiments 
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B = 6/D = 6 B = 2/D = 16 



Figure 5: Empirical performance of BRUE, UCT, and e-greedy + UCT (denoted as GCT) in 
terms of the average error on the random game trees with branching factor B 
and tree depth D. 



here) are very impressive. We also show the results for BRUE per (0.9), a slight modification 
of BRUE(0.9) with a more permissive update scheme: Instead of updating only the state- 
action node at the level of the switching point, we also update any ancestor for which either 
not all applicable actions have been sampled or the chosen action was identical to the best 
empirical one. 

All four algorithms were implemented within a single software infrastructure. As sug- 
gested by more recent works on UCT, the exploration coefficient for UCT and e-greedy + UCT 
(parameter c in Eq. 1) was set to the empirical best value of an action at the decision 
point (Keller & Eyerich, 2012b). (This setting of the exploration coefficient resulted in bet- 
ter performance of both UCT and e-greedy + UCT than with the settings reported on the 
sailing domain in the respective original publications.) The e parameter in e-greedy + UCT 
was set to 0.5 as in the experiments of Tolpin & Shimony, 2012. Each algorithm was run 
on 1000 randomly chosen initial states sq, and the performance of the algorithm was as- 
sessed in terms of the average error Q(so,a) — V(sq), that is, the difference between the 
true values of the action a chosen by the algorithm and that of the optimal action tt*(sq). 
Consistently with the results reported by Tolpin and Shimony (2012), on the smaller tasks 
e-greedy + UCT outperformed UCT by a very large margin, with the latter exhibiting very 
little improvement over time even on the smallest, 5x5, grids. The difference between 
e-greedy + UCT and UCT on the larger tasks was less notable. In turn, BRUE substantially 
outperformed e-greedy + UCT, with the improvement being consistent except for relatively 
short planning deadlines, and BRUE pcr (0.9) performed even better than BRUE. 

The above allows us to conclude that BRUE is not only attractive in terms of the 
formal performance guarantees, but can also be very effective in practice for online planning. 
Likewise, the "learning with forgetting" extension of BRUE(a) also has its practical merits. 
Under the same parameter setting of UCT and e-greedy + UCT, we have also evaluated the 
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three algorithms in a domain of random game trees whose goal is a simple modeling of 
two-person zero-sum games such as Go, Amazons and Globber. In such games, the winner 
is decided by a global evaluation of the end board, with the evaluation employing this or 
another feature counting procedure; the rewards thus are associated only with the terminal 
states. The rewards are calculated by first assigning values to moves, and then summing up 
these values along the paths to the terminal states. Note that the move values are used for 
the tree construction only and are not made available to the players. The values are chosen 
uniformly from [0, 127] for the moves of MAX, and from [—127,0] for the moves of MIN. 
The players act so to (depending on the role) maximize/minimize their individual payoff: 
the aim of MAX is to reach terminal s with as high R(s) as possible, and the objective of 
MIN is similar, mutatis mutandis. This simple game tree model is similar in spirit to many 
other game tree models used in previous work (Kocsis & Szepesvari, 2006; Smith & Nau, 
1994), except that the success/failure of the players in measured not on a ternary scale of 
win/lose/draw, but via the actual payoffs they receive. We ran some experiments with two 
different settings of the branching factor (B) and tree depths (D). As in the sailing domain, 
we compared the convergence rate obtained by BRUE, UCT and e-greedy + UCT. Figure 5 
plots the average error rate for two configurations, B = 6, D = 6 and B = 2, D = 16, with 
the average in each setting obtained over 500 trees. The results here appear encouraging as 
well, with BRUE overtaking the other two algorithms more quickly on the deeper trees. 

7. SUMMARY 

We have introduced BRUE, a simple Monte-Carlo algorithm for online planning in MDPs 
that guarantees exponential-rate reduction of the performance measures of interest, namely 
the simple regret and the probability of erroneous action choice. This improves over previous 
algorithms such as UCT, which guarantee only polynomial-rate reduction of these measures. 
The algorithm has been formalized for finite horizon MDPs, and it was analyzed as such. 
However, our empirical evaluation shows that it also performs well on goal-driven MDPs 
and two-person games. 

A few questions remain for future work. In the setting of 7-discounted MDPs with 
infinite horizons, a straightforward way to employ BRUE is to fix a horizon H, use the 
algorithm as is, and derive guarantees on the aforementioned measures of interest by sim- 
ply accounting for the additive gap of 7 fl mffl /(l — 7) between the state/action values 
under horizon H and those under an infinite horizon. However, this is not necessarily the 
best way to plan online for infinite-horizon MDPs, and thus this setting requires further 
inspection. Second, it is not unlikely that the state-space independent factors Ch, and c' h 
in the guarantees of BRUE can be improved by employing more sophisticated combinations 
of exploration and estimation samples. Another important point to consider is the speed 
of convergence to the optimal action, as opposed to the speed of convergence to "good" 
actions. BRUE is geared towards identifying the optimal action, although in many large 
MDPs, "good" is often the best one can hope for. To identify the optimal solution, BRUE 
devotes samples equally to all depths. However, focusing on nodes closer to the root node 
may improve the quality of the recommendation if the planning time is severely limited. 
Finally, the core tree sampling scheme employed by BRUE differs from the more standard 
scheme employed in previous work. While this difference plays a critical role in establishing 
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the formal guarantees of BRUE, it is still unclear whether that difference is necessary for 
establishing exponential-over-time reduction of the performance measures. 
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Appendices 



Appendix A. Proof of Theorem 1 

The proof of Theorem 1 relies on the inductive assumption with respect to the correctness 
of Lemma 2, as well as on several auxiliary claims that we prove in what follows. The 
dependence diagram below depicts the overall flow of the proof, with the more central 
claims being depicted with rectangular nodes. 




Proposition 9 (Concentration inequality for negative-binomial distributions) Let 

NB(t,p) be a random variable with negative-binomial distribution. 

p(jVB(t,p) < |H <e-e (31) 



Ap 

Proof: It is well known that the event in which the number of Bernoulli trials required to 
obtain the i-th success is smaller than some positive integer b is equivalent to the event that 
the number of successes in b Bernoulli trials is at least t. Therefore, for any < 5 < 1, 

p|NB(t,p) < (5^| =P jBin^,pj > t 



= Fi Bin [6-,p\ >t6 + (t-tS) 



(32) 



p 

;) 2 

< e ts 

by the Hoeffding inequality, 
and choosing 5 = | yields the result. ■ 

Proposition 10 (Number of Child Samples Bound) Let BRUE be called on a state 
sq of an MDP (S, A, Tr, R) with rewards in [0, 1] and finite horizon H. Let (s, h) be a node 
reachable from (sq,H), and in turn, (s',h') be a node reachable from (s,h) via an action 
sequence that starts with applying action a at s. Then, for any a' £ A(s'), we have 



( , a . Ph> (s',a') 



p hl (s',a'f 

n h (s,a)=t \ < 2e 6 ^- a > , (33) 



where ph(s,a) is the probability that an (H — h)-iteration of BRUE will issue a sample, 
whose exploration phase ends with applying action a at state s with h steps still to go. 
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Proof: By the choice of the switching point function of BRUE as in Eq. 9, the number 
of samples of action a' in the descendant node (V, h') between two consecutive samples of 
action a in node (s, h) is distributed according to 

1+7 

£A, (34) 
i=i 

where 7 ~ Geo(ph(s,a)) and fa ~ Ber (ph>(s', a')) are all independent random variables. 
Indeed, for every pair of consecutive iterations n < n' with <r(n) = cr(n') = H — h, 

(i) there is exactly one iteration n < n" < n' with <r(n") = H — h', and 

(ii) the number of (H — /i)-iterations between two consecutive (H — /i)-iterations that finish 
their exploration phase with applying action a at s is geometric. 

Putting (i) and (ii) together, the number of (H — /i')-iterations between a pair of consecutive 
{H — /i)-iterations that finish their exploration phase with applying action a at s is also 
geometric. In turn, the probability that an (H — /i')-iteration will finish its exploration 
phase with applying a' at s' is py (s', a'), and thus the number of {H — /i')-iterations that 
finish their exploration phase with applying a 1 at s' between a pair of consecutive (H — h)- 
iterations that finish their exploration phase with applying action a at s is distributed as 
in Eq. 34. 

Similarly, it can be shown that the (conditioned) random variable 

% {s',a') I n h (s,a) = t 

is distributed according to 

t+jt 

(35) 

i=l 

where jt ~ NB (t,ph (s, a)), fa ~ Ber (py (s' , a')), and all jt an d fa are independent. 

Therefore, denoting Ph(s, a) and Ph>(s' , a') by ph and py, respectively, for short, we have 



n h > {s, a) < -*—Ph> 
4ph 



n h < (s,a) = t 



Eq.35 




r 3^ * \ t+x t \ 

*=^ +1 U=1 J (36) 

lt<^+ £ p|Bin((t + x),^)< i^^'} P ^ t = a; > 



since ft are all independent Bernoulli variables with common parameter p h i 



lt<^ p -\+ Yl V{Bin({t + x),p h ,)<{t + x) Ph ,-6 x }F{'yt = x} 



4 Ph 
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where 6* = 4xph -^ 4ph) Ph > . 



Given that, for all x > we have 



P{Bm((t + x),p h .) < (t + x) Ph ,-5 x } < e («+-) 

by the Hoeffding inequality, applicable here since o x 

s 2 tpi, 

since — — > t H — 

t + x 4p h 

< e 2 Ph P h'. 
Plugging Eq. 37 into Eq. 36, we obtain 



4xp h - t(l - 4p h ) t(2 + 4p fe ) 

-Ph' > : Ph' > 



4ph 



4p h 



(37) 



n g (s, a) < - — p h > 



3t 



n g (s, a) 



Phi* 



t 4Ph > x= 3^ +1 

Prop. 9 tp h p h' 

< e~~ + 2^ e 2 PhP{ 7t = x } 

4 f>h 

< e e + e 2 Ph 



< 2e~ 



h' 



(38) 



Proposition 11 Lei BRUE be called on a state sq of an MDP (S,A,Tr,R) with rewards 
in [0, 1] and finite horizon H . Let (s, h + 1) be a node reachable from (so, H), and in turn, 
(s' , h!) be a node reachable from (s, h + 1). If Lemma 2 holds for horizon h, then, for any 
a £ A(s), a' £ A(s'), and t > I, 



Qh>(s',a')-Q h/ (s',a') > 



d 



n h+1 (s, a) = t \ < (2 + c h e c h' okW-v , (39) 



and 



Qh'{s',a) - Q h >(s',a') < 



n h+l ( s , a) = t\ < (2 + c h >) e tCh ' exh+i-h' . ( 40 ) 



Proof: The proof for the two pairs of equations is identical, and thus we explicitly prove 
here only Eq. 39. In what follows, we use Ph(s, a) and py (s' , a') as defined in Proposition 10, 
and here as well denote them by ph and p^, respectively, for short. Similarly, by Qy, Qh', 
ny, and rih+i we refer to Qy, Qy(s',a'), n h'( s ', a '), an d rih+i(s,a), respectively, for short. 
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I Qh> - Qh> > 



2 



4^^ 



+1 



nh+i = t 



Prop. 10 _ 1 h> f ^ A ] 

< 2e 6 ^+i + P j Q h , - Q h , > - n h , =r\F{n h , =r\n h+1 = t} 

tv, y ~ ' 



:.A./Eq.l4 



~ 4 P/1+1 

oo 



(41) 



< ' 2e 6p ft+i + c h ,e- TC yF{n h , = r \ n h+l = t} 



T= tp h' 



< 2e 6ph + 1 + eye h ' 4p h+i 

iider the 
implies that 



Consider the fraction and recall that (s',h') is a descendant of (s,h). The latter 



Ph> > Ph+i 



p \h+l-ti 



(I) 



and thus > ^ • Continuing now Eq. 41, by Eq. 16, 

3d 2(h'-i) p H+h'-i H+h'-l , p x //-ft' 



Cft ' 32 h '- 1 ((/i')!) 2 ^^ 1 ^ 



and Eq. 41 under c'^, < py implies 



tp h l 



2 e 6p h +i +Ch , e h'4 Ph+1 < (2 + c h /) e h ' 6 PM-i 



< TV 



(42) 



< (2 + c h /)e tc 'h' eK^+i-^' . 



Proposition 12 (Expected accumulated rewards) Lei (£, A, Tr, R) be an MDP, and 
let X be the accumulated reward of a sample 

p= {s,a,s 1 ,a 1 ,s h ,a h ,s h+1 ), 

started with taking action a £ A in state s £ S, and continued with additional h steps, in 
which actions are chosen according to some arbitrary (possibly randomized) policy n. Let 

• E^^h+i (s, a) denote the event in which, after a, p is sampled along the optimal actions, 
that is, for i € [/&], ctj = vr^ +1 _ i (sj), and 



S w>h +i (s, a) = Q h+ i (s, a) - E [X]. 
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Then, 



'{^E„ ih+1 (s,a)} < ^Pl^+^^^^+i-i^)}, (43) 

i=l 

(s, a) = ^E[A ft+1 _j[sj,7r ft+ i_j(sj)]] . (44) 



i=i 



Proof: The proof of Eq. 43 is straightforward by the union bound. To prove Eq. 44, we 
note that for any state/steps-to-go pair (s, h) € S x \H\, we have 



E^, [R (a, 7r h (5), s')] = E u [Q ft (a, tt^s))] - E^y [Q ft _i(s', 7r*(s', /i - 1))] . 



Using that, we obtain a telescopic series that yields 



E[X] = E 7r , si: 



i? (s, a, si) + ^ i? (sj, 7r h+ i_i(si), s i+ i) 



i=i 



Qft + i(s,a) -E S1 [Qh(si,7r*(si, /i))] + 
ft 

^ (E^ jSi:Si [Qfe+i-i(si, 7r/j + i_i(sj))] - E 7I - iSi;Si+1 7T*(s i+ i, /t - i))]) 

1=1 

h 

Qft + l(s,a) - y^^n,si:sj [A/ l+ l_j[sj,7r/ l _j + i(s i )]] . 
i=l 



Proof of Lemma 4: 

By Definition 1, the event E t ^ + i (s,a) corresponds to a sample 

p= {s,a,s 1 ,a 1 ,s h ,a h ,s h+1 ), 



obtained by taking action a at state s, reachable from sq with h + 1 steps still to go, and 
then following the policy 7rf, induced by running BRUE on so until exactly t > samples 
finish their exploration phase with applying action a at s with h steps still to go. From 
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Proposition 12, denoting i = h + 1 — i, we have 
h 



P {^E tyh+1 (s, a)} < I? Ui / < (*) I "h+i (*, «) = *} 
i=l 

E v{Qi{*i>a')>Qi(si,*t(si)) 



h 



(s,a) = tj 



^E E [ P a) > 



i=l o'^!(si) 



nh+i 

nh+i (s, a) :- / > + 
d 



n>h+i (s,a) = t 



Prop. 11 



up.rr _ — ^ . / p 

< 222K(2 + a)e 



i=i 



(*) 

< 2K/i(2 + C/l )e- te '^. 



(45) 



The last inequality (*) in Eq. 45 holds because, by assuming Lemma 2 for horizon h, 
for i € |7i], it can be straightforwardly derived from Eqs. 15 and 16 that Cj > Cj_i and 

c i K c i-1- 
Similarly, 



St,h+i (a, a) < ^EAifc, 

^ p{Qi ( ai ,a') >Q i (s i , 7 r?(si)) 



i=i 

h 

i=l 
h 

i=l 

Prop. 11 



a'#w?(«i) 



a') > Qi (si,7Tj (si)) 



jift+i (s,a) = tj 



(46) 



3p.ll — + f V 



i=i 

< 2i^/i 2 (2 + c/,)e" tc, h« 



Fact 13 Let Z be a random variable with support [a,b] and K[Z] = 0. Then, for any 
A G R+ 

E [exp(AZ)] < exp( (6 ~ Q) A ). 



This result is well known due to Hoeffding. 



27 



Proof of Lemma 3 (Modified Hoeffding-Azuma inequality): 

Let E t be the event that E [X t \ X±, . . . , X t ~i] = fi, and let 

Y t ±X t - l i\X l (w) , . . . , X t _i (w) . 

The random variable Y< is bounded by /i — /i > Y~i > — /x, and furthermore, for 
EY t = 0. Therefore, using Fact 13, for all co G E t and A G R + , it holds that 



E 



< e 



Moreover, 



E 



= ^ 
= E 

< e' 



+ E- 



e AE' = i(M-^) 



E 



l, . . . , 

A(//-X t ) 



■E 



-X X 

Xi, . . . , X t _i 
+ P{^E t }e At/l 



+ E^ 



,aeLi(m-^) 



Eq-18 A 2 h 2 r „ t _i, . 



(*) A 2 ft2 



t(Xh-c e ) 



< e 



r=l 



by the auxiliary step in Eqs. 49-51 below 



< e 



e s e 



r(\h-c e ) 



T=l 



Considering the recursion 

/(t) = 0/(t-l) + i/(*), 
it is easy to verify that, for all < c < t, 

t 

f(t)=d t - c f{c)+ ^ T 9{r). 

T=C+1 

Given that, the bound (*) in Eq. 48 is obtained by setting 



A h 



V = e s , 
/(t)=E L 
5 (t) = c p e^ Xh - c ^ 



e AE' = i(M-^i) 
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Now, by Markov inequality, for any A > 0, 

< e~ xt5 E J e A E*=i(^- x i) 

oo 

1 + 



Eq-48 ut A 2 /, 2 * 



s e 



r(AA-c e ) 



2<S 2 c e , «M 

= e h 2 r e a/* 2 



T = l 

oo 



e 2h 2T e r i^ Ce i 



T=l 



by setting A : 



3^ 



< e 2h 2 



since 5 < 



< e 



25c e 
h 2 

2h 2 _* 2 ^ 
1 + c ra e ^ 



1 + c 



2/i 2 



(52) 



The second bound can be proven in much the same way. ■ 

Discussion 14 Note that the above bound was obtained for a particular choice of A that 
maximizes the coefficient term in the exponent. Other choices of A may result in a smaller 
coefficient in the exponent, but also a decreased leading constant. In particular, setting 
A = , for any < j3 < 1, yields the following bound 



Ht-^2Xi>t5 



< e 



1 + 



- S ( 2Sc e l3 \ 



Ce (1 " P) 



r=l 

e 



2ft 2 " 



(53) 



Proof of Lemma 5: 

Lemma 4 implies that, with probability approaching 1 exponentially fast, the state-space 
samples issued at a level with h + 1 steps-to-go are optimal. That is, their expectation 
equals the actual Q-value. Therefore, by Lemma 4, we have 

P{E [X t>h+1 {s, a) | X 1A+1 (s, a),..., X t _ 1>h+1 {s, a)] / Q {s, a)} < c v e~ c *\ (54) 

where c p = 2Kh(2 + Ch) and c e = It is also easy to see that < Xi < h + 1, and thus 
the conditions of Lemma 3 are satisfied. In turn, from Lemma 3 for 5 = i and random 
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variables with support [0, h + l], 



Qh+i (s,a) - Q h+ i (s,a) > 



d 



n h+ i (s,a) 



< 



1 + Cp 



2(h + l) 

(I) 2 4 



d P c h 



< g 16(h+l) 2 K 



1152 • 



e 2(h+i)T l 
K 3 (h + lf(2 + c h ) 



(55) 



and, similarly, 



Qh+i (s,a) - Qh+i (s,a) < 



n h+1 (s, a) =t 



< g 16(h+iyK 



< g 16(h+l) 2 K 



1152 



3456 • 



K 3 (h + l) 3 {2 + c h ) 
d 2 p 2 c'fi 

K 3 {h + lfc h 

( pp2 c /2 



since 2 + c h < 3ch- 



Proof of Lemma 2, induction step: 



(56) 



Note that the proof of Lemma 5 is basically the proof of the induction step for the key part 
of Lemma 2, that is, Eq. 14. The only thing that remains to be finalized is the correctness 
of Eqs. 15 and 16 for h + l, and these can be verified by substitution of Ch and c' h in Eq. 56 
by the respective expressions (for h) from Eqs. 15 and 16. ■ 

Proof of Theorem 1: 

The proof for our main results follows by using the same techniques as above. Note that, 
by the Hoeffding inequality, after n > iterations of BRUE, for each action a G A(sq), it 
holds that 

n i i 



< e 2K2 H '\ 



(57) 



Given that, 



K(so,#) #t*(s„, H)} 



Qh (so, a) > Q H {so, a) + - } 4 



(58) 



Qh (so,tt*(so,H))<Q h (s , it* (s , H)) 
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For a sub-optimal action a, 

F^Q H (s , a) >Q H (s ,a) + ^ 

n 

= P \Qh (so, a) > Q H (s , a) + 
< P {»*(*„, a) <^} 



"if (so, a) = i ^ P {nij (s , a) = t} 



+ w{Q H (s ,a) > Q H (so,a) + 

t— 1 i n \ 



n H (so, a) = t > P {n ff (s , a) = t} 



Lemma 2 



lma 2 1 „ . , 

< e 57^™+ ^ cue^H 1 -F{n H (so,a) = t} 

(—1-1 Vl 

1 

< e~ 2K^H n + CHe~2KH n 

< 2cne 2kh u . 

Using exactly the same line of bounding, we obtain 



(59) 



Qh (s ,7r*(so,H)) < Q H (s ,ir*(so,H)) - ^| < 2 C//e "2?& n , (60) 
and thus 

F{tt% ,H)^tt*(s ,H)} <AKc H e-^H n . (61) 

Eqs. 11, 12, and 13 of Theorem 1 are then obtained by substitution of ch and c' H in Eq. 61 
with the respective expressions from Eqs. 15 and 16. In turn, Eq. 10 of Theorem 1 stems 
from Eqs. 11, horizon H, and per-step rewards being in [0, 1]. ■ 
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Appendix B. Proof of Theorem 6 

We first prove the modified Hoeffding-Azuma inequality for partial sums. 



Proof of Lemma 8 (Modified Hoeffding-Azuma inequality for partial sums): 

Let E t be the event that E [X t \ X±, . . . , X t -i] = Li, and let 

Y t ^X t -n\X 1 (w) , . . . , X t -! (w) . 

The random variable Yt is bounded by h — fx > Yt > — /x, and furthermore, for ui £ Et, 
KYt = 0. Therefore, using Fact 13, for all ui £ Et and A 6 M + , it holds that 



E 



< e" 



(62) 



Moreover, 



E 



= E 



E 



e AEU-rati(f-^i) 



Xi, . . . , X t _i 



t-\<xi] (V~ X i) 



9 AEi =t -rati(f-^i) 



= A( M -X t ) 



< e~s~E 
< < h''e [e 



Ai, . . . , Xt-i 
+ F{^E t }e Xhlat] 



+ E^ 



(63) 



E "- 18 r^; ; „ ( (,-x,) 



_|_ c e A/i[at]-c e t 



(*) A 2 h 2 

< e 



(t-r) \h[ar]~c e T 
r=t-[at] 

by the auxiliary step in Eqs. 49-66 below. 



Considering the recursion 

f(t) = 9f(t-l) + g(t) 
it is easy to verify that, for all < c < t, 

t 



f (t) = 9^/ (c) + £ e^g (r) 



(64) 



(65) 



r=c+l 



Given that, the bound (*) in Eq. 48 is obtained by setting 



e 


A 

= e 


2 h 2 
8 , 


fit) 


= E 


e A£i=iO»-*i)' 


git) 





(66) 
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Now, by Markov inequality, for any A > 0, 



H\at] - ^2 X { > \at]S 

i=t— \at\ 
-XS\at] m [„AEU_ ratl (M-^)l 



< e 

Eq.48 



2l5 c e r„+i 



e s 



e 2h- 



-\at] + ^ e SM*- r ) . (y A/ ' ~ 



T =t- \at\ 



-\at] 



+ E 

T =t- [at] 



by setting A 



2fc e 
ft 2 



< e h 2 



5_4e r at i ^4 

e^ 1 1 + e^? 



EMC 



T =t- \ a i\ 



< g h 2 1 1 e 2h 2 1 1 

since o < — 



< e fc 2 e 2h 2 1 1 



1 + C p 



3 c e [ar]-c e r 



T =t- \at\ 



t 



i + Cp 



= -Ce(l-a)T 



r=t- Tat] 

Cri 



3<S^e, 

< e W 



- ttt 



1 + 



c e (l - a) 



1 + - p e ~^(i-°) 2 * 
c e (l-a) 



-c e (l-a) 2 t 



(67) 



Obviously, at leaf nodes there is no point in choosing a < 1 since there is no bias. 
Therefore, for h = 1 we can use the same constants c\ = 1 and = j^. Since is 

decreasing with h, we have c' h < ^ H -h f° r I < h < H, and thus Lemma 4 is valid. 
Lemma 5 relies on the modified Hoeffding-Azuma inequality, which is no longer valid in 
the context of BRUE (a). Instead, we apply its modification, Lemma 8 for partial sums, to 
prove the induction step 



Q h+ i (s,a) - Q h+ i (s,a) > 



n h+ i (s,a) = t 



,2 / 
Ad pcfo 



< e 4SKh 2 



at 



12K 2 h(2 + c h ) K(i-a) 2 ; 

1 H -e L 

pd h (l - a) 



(68) 
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and, similarly, 



I Qh+i (s, a) - Q h+1 (s, a) < - - 



3cT 



j2 / 



n h+1 (s,a) = I 
] 12K 2 h(2 + c h ) p vAS^f ' 



pd h (l - a) 



6K 



(69) 



The induction step is satisfied, e.g., with c' h+1 = min j j^T , ^^qk^ } anc ^ c h+i = 1 + 

\2K 2 h(2+c h ) 

Since Ch is increasing in h and c' h is decreasing in h, the term 12 t^z^x^ also increases 
in /i. The larger the constant grows, the more beneficial it might be to increase the exponent 
coefficient that multiplies that constant by decreasing a at the expense of decreasing the 
exponent coefficient that multiplies 1. Clearly, the tradeoff depends also on t, the number 
of samples of action a in node (s,h). Therefore, as h increases, smaller values of a would 
be more appealing. 
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